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Abstract 

This paper describes a temporal expression 
identification and normalization system, Man- 
TIME, developed for the TempEval-3 chal- 
lenge. The identification phase combines 
the use of conditional random fields along 
with a post-processing identification pipeline, 
whereas the normalization phase is carried out 
using NorMA, an open-source rule-based tem- 
poral normalizer. We investigate the perfor- 
mance variation with respect to different fea- 
ture types. Specifically, we show that the use 
of WordNet-based features in the identifica- 
tion task negatively affects the overall perfor- 
mance, and that there is no statistically sig- 
nificant difference in using gazetteers, shal- 
low parsing and propositional noun phrases 
labels on top of the morphological features. 
On the test data, the best run achieved 0.95 
(P), 0.85 (R) and 0.90 (Fl) in the identifica- 
tion phase. Normalization accuracies are 0.84 
(type attribute) and 0.77 (value attribute). Sur- 
prisingly, the use of the silver data (alone or in 
addition to the gold annotated ones) does not 
improve the performance. 

1 Introduction 



Temporal information extraction ( |Verhagen et al., 
2007[|Verhagen et al., 2010[ ) is pivotal for many Nat- 
ural Language Processing (NLP) applications such 
as question answering, text summarization and ma- 
chine translation. Recently the topic aroused in- 



creasing interest also in the medical domain ( Sun et 



al, 2013 Kovacevic et al, 2013 1. 



tionally divided into two main steps: identification 
and normalization. In the former step, the effort 
is concentrated on how to detect the right bound- 
ary of temporal expressions in the text. In the nor- 
malization step, the aim is to interpret and repre- 
sent the temporal meaning of the expressions using 



TimeML (Pustejovsky et al., 2003) format. In the 



TempEval-3 challenge ( [UzZaman et al., 2012| ) the 
normalization task is focused only on two temporal 
attributes: type and value. 

2 System architecture 

ManTIME mainly consists of two components, one 
for the identification and one for the normalization. 

2.1 Identification 

We tackled the problem of identification as a se- 
quencing labeling task leading to the choice of Lin- 



ear Conditional Random Fields (CRF) (Lafferty et 



al., 2001 ). We trained the system using both human- 



Following the work of Ahn et al. (2005), the 
temporal expression extraction task is now conven- 



annotated data (TimeBank and AQUAINT corpora) 
and silver data (TE3Silver corpus) provided by the 
organizers of the challenge in order to investigate the 
importance of the silver data. 

Because the silver data are far more numerous 
(660K tokens vs. 95K), our main goal was to rein- 
force the human-annotated data, under the assump- 
tion that they are more informative with respect to 
the training phase. Similarly to the approach pro- 
posed by Adafre and de Rijke ( |2005| ), we developed 
a post-processing pipeline on top of the CRF se- 
quence labeler to boost the results. Below we de- 
scribe each component in detail. 



2.1.1 Conditional Random Fields 

The success of applying CRFs mainly depends on 
three factors: the labeling scheme (BI, BIO, BIOE 
or BIOEU), the topology of the factor graph and 
the quality of the features used. We used the BIO 
format in all the experiments performed during this 
research. The factor graph has been generated us- 
ing the following topology: (wq), (w-i), (w-2), 
(tu+i), (w +2 ), (W-2AW-1), (w-i/\w ), (w Aw +1 ), 
(w-iAwoAw + i), (w Aw + i Aio +2 ), (w + iAw +2 ), 
{w-2 A W-\ A Wq), (w-i A w + \) and (w-2 A w+2). 

The system tokenizes each document in the cor- 
pus and extracts 94 features. These belong to the 
following four disjoint categories: 

• Morphological: This set includes a compre- 
hensive list of features typical of Named En- 
tity Recognition (NER) tasks, such as the word 
as it is, lemma, stem, pattern (e.g. 'Jan-2003': 
'Xxx-dddd'), collapsed pattern (e.g. 'Jan- 
2003': 'Xx-d'), first 3 characters, last 3 charac- 
ters, upper first character, presence of V as last 
character, word without letters, word without 
letters or numbers, and verb tense. For lemma 
and POS tags we use TreeTagger (|Schmid, 



1994). Boolean values are included, indicating 



if the word is lower-case, alphabetic, digit, al- 
phanumeric, titled, capitalized, acronym (cap- 
italized with dots), number, decimal number, 
number with dots or stop-word. Additionally, 
there are features specifically crafted to han- 
dle temporal expressions in the form of regu- 
lar expression matching: cardinal and ordinal 
numbers, times, dates, temporal periods (e.g. 
morning, noon, nightfall), day of the week, sea- 
sons, past references (e.g. ago, recent, before), 
present references (e.g. current, now), future 
references (e.g. tomorrow, later, ahead), tem- 
poral signals (e.g. since, during), fuzzy quan- 
tifiers (e.g. about, few, some), modifiers, tem- 
poral adverbs (e.g. daily, earlier), adjectives, 
conjunctions and prepositions. 

Syntactic: Chunks and propositional noun 
phrases belong to this category. Both are 
extracted using the shallow parsing software 

mbsfS 



• Gazetteers: These features are expressed us- 
ing the BIO format because they can include 
expressions longer than one word. The inte- 
grated gazetteers are: male and female names, 
U.S. cities, nationalities, world festival names 
and ISO countries. 

• WordNet: For each word we use the number of 
senses associated to the word, the first and the 
second sense name, the first 4 lemmas, the first 
4 entailments for verbs, the first 4 antonyms, 
the first 4 hypernyms and the first 4 hyponyms. 
Each of them is defined as a separate feature. 

The features mentioned above have been com- 
bined in 4 different models: 

• Model 1: Morphological only 

• Model 2: Morphological + syntactic 

• Model 3: Morphological + gazetteers 

• Model 4: Morphological + gazetteers + Word- 
Net 

All the experiments have been carried out using 
CRF++ 0.5^] with parameters C = 1, rj = 0.0001 
and L2-regularization function. 

2.1.2 Model selection 

The model selection was performed over the 
entire training corpus. Silver data and human- 
annotated data were merged, shuffled at sentence- 
level (seed = 490) and split into two sets: 80% as 
cross-validation set and 20% as real-world test set. 
The cross-validation set was shuffled 5 times, and 
for each of these, the 10-fold cross validation tech- 
nique was applied. 

The analysis is statistically significant (p = 
0.0054 with ANOVA test) and provides two impor- 
tant outcomes: (i) the set of WordNet features nega- 
tively affects the overall classification performance, 
as suggested by Rigo et al. (201 1 ). We believe this is 
due to the sparseness of the labels: many tokens did 
not have any associated WordNet sense, (ii) There 
is no statistically significant difference among the 
first three models, despite the presence of apparently 
important information such as chunks, propositional 



http://www.clips.ua.ac.be/software/mbsp-for-python 



https://code.google.eom/p/crfpp/ 



I I 

I I 

I I I -r- 



Figure 1: Differences among models using 5xl0-fold 
cross-validation 



noun phrases and gazetteers. The Figure[T]shows the 
box plots for each model. 

In virtue of this analysis, we opted for the smallest 
feature set (Model 1) to prevent overfitting. 

In order to get a reliable estimation of the perfor- 
mance of the selected model on the real world data, 
we trained it on the entire cross-validation set and 
tested it against the real-word test set. The results 
for all the models are shown in the following table: 



System 


Pre. 


Rec. 


F/3=l 


Model 1 


83.20 


85.22 


84.50 


Model 2 


83.57 


85.12 


84.33 


Model 3 


83.51 


85.12 


84.31 


Model 4 


83.15 


84.44 


83.79 



Precision, Recall and Fp=i score are computed 
using strict matching. 

The models used for the challenge have been 
trained using the entire training set. 

2.1.3 Post-processing identification pipeline 

Although CRFs already provide reasonable per- 
formance, equally balanced in terms of precision 
and recall, we focused on boosting the baseline per- 
formance through a post-processing pipeline. For 
this purpose, we introduced 3 different modules. 

Probabilistic correction module averages the 
probabilities from the trained CRFs model with the 
ones extracted from human-annotated data only. For 
each token, we extracted: (i) the conditional proba- 



bility for each label to be assigned (B, I or O), and 
(ii) the prior probability of the labels in the human- 
annotated data only. The two probabilities are aver- 
aged for every label of each token. The list of tokens 
extracted in the human-annotated data was restricted 
to those that appeared within the span of temporal 
expressions at least twice. The application of this 
module in some cases has the effect of changing the 
most likely label leading to an improvement of re- 
call, although its major advantage is making CRFs 
predictions less strict. 

BIO fixer fixes wrong label sequences. For the 
BIO labeling scheme, the sequence O-I is necessar- 
ily wrong. We identified B-I as the appropriate sub- 
stitution. This is the case in which the first token 
has been incorrectly annotated (e.g. "Three/O days/I 
ago/I JO" is converted into "Three/B days/I ago/I 
JO"). We also merged close expressions such as B- 
B or I-B, because different temporal expressions are 
generally divided at least by a symbol or a punctu- 
ation character (e.g. "Wednesday/B morning/B" is 
converted into "Wednesday/B morning/I"). 

Threshold-based label switcher uses the prob- 
abilities extracted from the human-annotated data. 
When the most likely label (in the human-annotated 
data) has a prior probability greater than a certain 
threshold, the module changes the CRFs predicted 
label to the most likely one. This leads to force 
the probabilities learned from the human-annotated 
data. 

Through repeated empirical experiments on a 
small sub-set of the training data, we found an 
optimal threshold value (0.87) and an optimal se- 
quence of pipeline components (Probabilistic cor- 
rection module, BIO fixer, Threshold-based label 
switcher, BIO fixer). 

We analyzed the effectiveness of the post- 
processing identification pipeline using a 10-fold 
cross-validation over the 4 models. The difference 
between CRFs and CRFs + post-processing pipeline 
is statistically significant (p = 3.51 x 10~ 23 with 
paired T-test) and the expected average increment is 
2.27% with respect to the strict F$—\ scores. 

2.2 Normalization 

The normalization component is an updated version 



of NorMA (Filannino, 2012 1, an open-source rule- 
based system. 



# 

run 


Training data 
(post-processing) 


Identification 


Normalization 


Overall 
score 


Strict matching 


Lenient matching 


Accuracy 


Pre. 


Rec. 


F/3=l 


Pre. 


Rec. 


F/3=l 


Type 


Value 


1 


Human&Silver (no) 


78.57 


63.77 


70.40 


97.32 


78.99 


87.20 


88.99 


77.06 


67.20 


2 


Human&Silver (yes) 


79.82 


65.94 


72.22 


97.37 


80.43 


88.10 


87.38 


75.68 


66.67 


3 


Human (no) 


76.07 


64.49 


69.80 


94.87 


80.43 


87.06 


87.39 


77.48 


67.45 


4 


Human (yes) 


78.86 


70.29 


74.33 


95.12 


84.78 


89.66 


86.31 


76.92 


68.97 


5 


Silver (no) 


77.68 


63.04 


69.60 


97.32 


78.99 


87.20 


88.99 


77.06 


67.20 


6 


Silver (yes) 


81.98 


65.94 


73.09 


98.20 


78.99 


87.55 


90.83 


77.98 


68.27 



Table 1 : Performance on the TempEval-3 test set. 



3 Results and Discussion 



4 Conclusions 



We submitted six runs as combinations of different 
training sets and the use of the post-processing iden- 
tification pipeline. The results are shown in Table [T] 
where the overall score is computed as multiplica- 
tion between lenient Fp = \ score and the value accu- 
racy. 

In all the runs, recall is lower than precision. This 
is an indication of a moderate lexical difference be- 
tween training data and test data. The relatively low 
type accuracy testifies the normalizer's inability to 
recognize new lexical patterns. Among the correctly 
typed temporal expressions, there is still about 10% 
of them for which an incorrect value is provided. 
The normalization task is proved to be challenging. 

The training of the system by using human- 
annotated data only, in addition to the post- 
processing pipeline, provided the best results, al- 
though not the highest normalization accuracy. Sur- 
prisingly, the silver data do not improve the per- 
formance, both when used alone or in addition 
to human-annotated data (regardless of the post- 
processing pipeline usage). 

The post-processing pipeline produces the high- 
est precision when applied to the silver data only. 
In this case, the pipeline acts as a reinforcement of 
the human-annotated data. As expected, the post- 
processing pipeline boosts the performance of both 
precision and recall. We registered the best improve- 
ment with the human-annotated data. 

Due to the small number of temporal expressions 
in the test set (138), further analysis is required to 
draw more general conclusions. 



We described the overall architecture of ManTIME, 
a temporal expression extraction pipeline, in the 
context of TempEval-3 challenge. 

This research shows, in the limits of its general- 
ity, the primary and exhaustive importance of mor- 
phological features to the detriment of syntactic fea- 
tures, as well as gazetteer and WordNet-related ones. 
In particular, while syntactic and gazetteer-related 
features do not affect the performance, WordNet- 
related features affect it negatively. 

The research also proves the use of a post- 
processing identification pipeline to be promising 
for both precision and recall enhancement. 

Finally, we found out that the silver data do not 
improve the performance, although we consider the 
test set too small for this result to be generalizable. 

To aid replicability of this work, the system 
code, machine learning pre-trained models, statis- 
tical validation details and an online DEMO are 
available at: |http : //www, cs .man . ac .uk/| 



|~f ilannim/pro jects/tempeval-3/ 
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